The Tidyverse is a collection of R packages for data science that share a common design philosophy, grammar, and data structures.
It makes it easy to install and load core packages with a single command.
It includes packages that are used in everyday data analysis.
Consistency: Consistent syntax and data structures across packages.
Ease of Use: Simple, intuitive functions.
Performance: Efficient and optimized for modern data science.
Versatility: Ideal for various tasks in data manipulation, visualization.
Comprehensive: Complete package for all stages of data science workflows.
ggplot2
: Data visualization.dplyr
: Data manipulation.tidyr
: Data tidying.readr
: Data input.purrr
: Functional programming.tibble
: Data frames.stringr
: String manipulation.forcats
: Factor handling.lubridate
: Date and time manipulation.To install Tidyverse: install.packages(“tidyverse”)
To load Tidyverse into your current R session: library(tidyverse)
The pipe operator (%>%) or (|>) is used to pass the result of one function directly to the next function in a sequence, making your code more readable and intuitive.
Pipes allow you to write operations sequentially, enhancing clarity and reducing complexity.
Syntax:
data %\>% function1() %\>% function2() %\>% function3()
data: The input data is passed through the pipe.
function1(), function2(), function3(): These functions operate on the data in sequence, with the output of one function feeding directly into the next.
The shortcut to type the pipe operator in RStudio is given by CTRL/CMD + Shift + M.
Simulating a sample of data by using the function sample we draw randomly (without replacement) 5 numbers between 1 and 20 (1:20) and compute the log transformation of the vector.
dplyr
dplyr
dplyr is designed for easy data manipulation using verbs that describe the operations you want to perform. These common functions are:
filter()
: Select rows based on conditions.
select()
: Choose columns to keep.
mutate()
: Add new columns or modify existing ones.
arrange()
: Sort the data.
summarize()
: reduces multiple values down to a single summary
group_by()
: Aggregate data by groups.
dplyr
verbsTaking diamonds dataset of the prices and other attributes of almost 54,000 diamonds (see ?diamonds).
[1] "tbl_df" "tbl" "data.frame"
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
select()
# A tibble: 53,940 × 1
carat
<dbl>
1 0.23
2 0.21
3 0.23
4 0.29
5 0.31
6 0.24
7 0.24
8 0.26
9 0.22
10 0.23
# ℹ 53,930 more rows
# A tibble: 53,940 × 4
carat cut color price
<dbl> <ord> <ord> <int>
1 0.23 Ideal E 326
2 0.21 Premium E 326
3 0.23 Good E 327
4 0.29 Premium I 334
5 0.31 Good J 335
6 0.24 Very Good J 336
7 0.24 Very Good I 336
8 0.26 Very Good H 337
9 0.22 Fair E 337
10 0.23 Very Good H 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 4
carat cut color price
<dbl> <ord> <ord> <int>
1 0.23 Ideal E 326
2 0.21 Premium E 326
3 0.23 Good E 327
4 0.29 Premium I 334
5 0.31 Good J 335
6 0.24 Very Good J 336
7 0.24 Very Good I 336
8 0.26 Very Good H 337
9 0.22 Fair E 337
10 0.23 Very Good H 338
# ℹ 53,930 more rows
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
# A tibble: 53,940 × 9
cut color clarity depth table price x y z
<ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
# A tibble: 53,940 × 6
depth table price x y z
<dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 61.5 55 326 3.95 3.98 2.43
2 59.8 61 326 3.89 3.84 2.31
3 56.9 65 327 4.05 4.07 2.31
4 62.4 58 334 4.2 4.23 2.63
5 63.3 58 335 4.34 4.35 2.75
6 62.8 57 336 3.94 3.96 2.48
7 62.3 57 336 3.95 3.98 2.47
8 61.9 55 337 4.07 4.11 2.53
9 65.1 61 337 3.87 3.78 2.49
10 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
filter
# A tibble: 13,791 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
3 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
4 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
5 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
6 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
7 0.29 Premium F SI1 62.4 58 403 4.24 4.26 2.65
8 0.22 Premium E VS2 61.6 58 404 3.93 3.89 2.41
9 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
10 0.3 Premium J SI2 59.3 61 405 4.43 4.38 2.61
# ℹ 13,781 more rows
# A tibble: 1,603 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
2 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
3 0.71 Premium D SI2 61.7 59 2768 5.71 5.67 3.51
4 0.71 Premium D VS2 62.5 60 2770 5.65 5.61 3.52
5 0.7 Premium D VS2 58 62 2773 5.87 5.78 3.38
6 0.72 Premium D SI1 62.7 59 2782 5.73 5.69 3.58
7 0.7 Premium D SI1 62.8 60 2782 5.68 5.66 3.56
8 0.72 Premium D SI2 62 60 2795 5.73 5.69 3.54
9 0.71 Premium D SI1 62.7 60 2797 5.67 5.71 3.57
10 0.71 Premium D SI1 61.3 58 2797 5.73 5.75 3.52
# ℹ 1,593 more rows
summarise
# A tibble: 1 × 2
`mean(price)` `median(price)`
<dbl> <dbl>
1 3933. 2401
group_by
group_by()
is used to group rows of a data frame by one or more columns.
This helps in performing operations like summarizing or aggregating data by categories.
# A tibble: 35 × 3
# Groups: cut [5]
cut color `mean(price)`
<ord> <ord> <dbl>
1 Fair D 4291.
2 Fair E 3682.
3 Fair F 3827.
4 Fair G 4239.
5 Fair H 5136.
6 Fair I 4685.
7 Fair J 4976.
8 Good D 3405.
9 Good E 3424.
10 Good F 3496.
# ℹ 25 more rows
mutate
arrange
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
2 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
3 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
4 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
5 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
6 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
mpg wt hp mpg_per_weight
Toyota Corolla 33.9 1.835 65 18.474114
Fiat 128 32.4 2.200 66 14.727273
Honda Civic 30.4 1.615 52 18.823529
Lotus Europa 30.4 1.513 113 20.092531
Fiat X1-9 27.3 1.935 66 14.108527
Porsche 914-2 26.0 2.140 91 12.149533
Merc 240D 24.4 3.190 62 7.648903
Datsun 710 22.8 2.320 93 9.827586
Merc 230 22.8 3.150 95 7.238095
Toyota Corona 21.5 2.465 97 8.722110
Hornet 4 Drive 21.4 3.215 110 6.656299
Volvo 142E 21.4 2.780 109 7.697842
Mazda RX4 21.0 2.620 110 8.015267
Mazda RX4 Wag 21.0 2.875 110 7.304348